Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polbin: Actually dump FlatGFA binary files #152

Merged
merged 7 commits into from
Mar 11, 2024
Merged

polbin: Actually dump FlatGFA binary files #152

merged 7 commits into from
Mar 11, 2024

Conversation

sampsyo
Copy link
Collaborator

@sampsyo sampsyo commented Mar 11, 2024

Now the polbin binary can both read and write two formats: text GFA files and "FlatGFA" binary files. So these four commands are possible:

$ polbin < something.gfa  # round trip through in-memory FlatGFA, print GFA to stdout
$ polbin -o cool.flatgfa < something.gfa  # convert GFA to FlatGFA
$ polbin -i cool.flatgfa  # print a FlatGFA file out as plain ol GFA, to stdout
$ polbin -i cool.flatgfa -o ice_cold.flatgfa  # glorified `cp`, no reason to do this

I also added test environments to check both kinds of round-tripping (through in-memory FlatGFA and through an on-disk file). It all works!!!!!

$ turnt -j -e polbin_mem -e polbin_file *.gfa
1..16
ok 1 - DRB1-3123.gfa polbin_mem
ok 2 - DRB1-3123.gfa polbin_file
ok 3 - LPA.gfa polbin_mem
ok 4 - LPA.gfa polbin_file
ok 5 - chr6.C4.gfa polbin_mem
ok 6 - chr6.C4.gfa polbin_file
ok 7 - k.gfa polbin_mem
ok 8 - k.gfa polbin_file
ok 9 - note5.gfa polbin_mem
ok 10 - note5.gfa polbin_file
ok 11 - overlap.gfa polbin_mem
ok 12 - overlap.gfa polbin_file
ok 13 - q.chop.gfa polbin_mem
ok 14 - q.chop.gfa polbin_file
ok 15 - t.gfa polbin_mem
ok 16 - t.gfa polbin_file

Conversion seems to be decently fast on these small examples. For our go-to big example, chr8.pan.gfa (4.2 GB), one run of conversion on my rapidly aging Intel iMac took 1m8s for parsing (GFA -> FlatGFA) and 1m44s for pretty-printing (FlatGFA -> GFA). Seems within the ballpark of reasonableness? (Moreover, the GFA seems to have round-tripped successfully. FWIW, just running diff to check took 22s.)

Maybe I went too far. But it does work!
`slice_from_prefix` returns None when the bytes are not aligned. Maybe
we should consider padding things in the binary format so that these
things can be aligned in the file!!!
@sampsyo sampsyo merged commit 60d001c into main Mar 11, 2024
3 checks passed
@sampsyo sampsyo deleted the polbin-dump branch March 11, 2024 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant